[10 pts]
Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. One way to overcome this problem is to 'not' use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on new data. This is the basic idea for a whole class of model evaluation methods called cross validation.
holdout method is the simplest kind of cross validation. The data set is separated into two sets, called the training set and the testing set. The function approximator fits a function using the training set only. Then the function approximator is asked to predict the output values for the data in the testing set (it has never seen these output values before). The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to compute. However, its evaluation can have a high variance. The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made.
K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over.
%%html
<img src="k_fold.png", width=900,height=600>
Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass it seems very expensive to compute. Fortunately, locally weighted learners can make LOO predictions just as easily as they make regular predictions. That means computing the LOO-XVE takes no more time than computing the residual error and it is a much better way to evaluate models. We will see shortly that Vizier relies heavily on LOO-XVE to choose its metacodes.
Advatage
(a) Better prediction results
(b) Avoid overfitting to training data in a linear models like linear regression
(c) Use all data to train and test, this is especially useful for a small dataset when the fewer number of samples may produce a result by chance and may not work on newer samples. By testing in a cross validation methodology the output can be tested thoroughly and not left to chance. Eg. If we use cross-validation in this case, we build K different models, so we are able to make predictions on all of our data. For each instance, we make a prediction by a model that didn’t see this example, and so we are getting 100 examples in our test set. For the multi-class problem, we get 10 examples for each class on average, and it’s much better than just 2. After we evaluated our learning algorithm we are now can train our model on all our data because if our 5 models had similar performance using different train sets, we assume that by training it on all the data will get similar performance.
Disadvantages
(a) In an ideal world, the cross validation will yield meaningful and accurate results. However, the world is not perfect. You never know what kind of data the model might encounter in the future.
(b) Usually, in predictive modelling, the structure you study evolves over a period. Hence, you can experience differences between the training and validation sets. Considering you have a model that predicts stock values. You have trained the model using data of the previous five years. it may not be realistic to expect accurate predictions over the next five-year period
(c) Here is another example where the limitation of the cross validation process comes to the fore. You develop a model for predicting the individual’s risk of suffering from a particular ailment.
[Please refer to our course textbook Elements of Statistical Learning Textbook for this] [5 Pts]
Liklihood based regression models such as the normal linear regression model and the linear log model, assume a linear form for the the covariate $ X_1, X_2, ... X_p $. When modeling real world scenarios, the linearity assumption may not hold. Generalized Additive methods are methods that may be used to identify and characterize non-linear regression/covariate effects.
Generalized Additive Models(GAMs) replace the the linear form $\Sigma\beta_jX_j $ with a a sum of smooth functions $\Sigma s_j(X_j) $
In the regression setting, a generalized additive model has the form:
$ E(Y|X_1 , X_2 , ...,X_p)= \alpha + f_1(X_1)+f_2(X_2)+ ... f_p(X_p) $
$X_1 , X_2 , ...,X_p $ represent predictors and Y is the outcome; $f_j$ are non parametric functions.
each function is fitted using a scatterplot smoother (e.g., a cubic smoothing spline or kernel smoother), and all p functions are simultaneously estimated.
This technique is viewed as an emperical method of maximizing the expected log liklihood or minimizing the Kullback-Leibler distance to the true model.
Decision trees differ from the rest of the classification methods in the way they generate the decision boundaries, i.e. the lines that are drawn to separate different classes. Decision Trees bisect the space into smaller and smaller regions, whereas Logistic Regression fits a single line to divide the space exactly into two. Of course for higher-dimensional data, these lines would generalize to planes and hyperplanes. A single linear boundary can sometimes be limiting for Logistic Regression. In this example where the two classes are separated by a decidedly non-linear boundary, we see that trees can better capture the division, leading to superior classification performance. However, when classes are not well-separated, trees are susceptible to overfitting the training data, so that Logistic Regression’s simple linear boundary generalizes better.
%%html
<img src="decision_boundary_tree.png"><img src="decision_boundary_log.png">
Different Classification Techniques:
1) Logic Based
- Decision Trees
- Learning Set of Rules
2) Perceptron Based
- Neural Networks
3) Statistical
- Bayesian Networks
- Instance based learning
- kNN
4) Support Vector Machines
How decision trees work:
The decision tree learning algorithm recursively learns the tree as follows:
Advantages
Trees are very inexpensive at test time
Decision trees can handle both nominal and numerical attributes.
Disadvantages
Categorical Predictors When splitting a predictor having q possible unordered values, there are $2^{q−1} − 1$ possible partitions of the q values into two groups, and the computations become prohibitive for large q. However, with a 0 − 1 outcome, this computation simplifies. We order the predictor classes according to the proportion falling in outcome class 1. Then we split this predictor as if it were an ordered predictor. One can show this gives the optimal split, in terms of cross-entropy or Gini index, among all possible $2^{q−1} − 1$ splits. This result also holds for a quantitative outcome and square error loss—the categories are ordered by increasing mean of the outcome.
For multicategory outcomes, no such simplifications are possible, although various approximations have been proposed (Loh and Vanichsetakul, 1988). The partitioning algorithm tends to favor categorical predictors with many levels q; the number of partitions grows exponentially in q, and the more choices we have, the more likely we can find a good one for the data at hand. This can lead to severe overfitting if q is large, and such variables should be avoided.
Loss Matrix In classification problems, the consequences of misclassifying observations are more serious in some classes than others. For example, it is probably worse to predict that a person will not have a heart attack when he/she actually will, than vice versa. To account for this, we define a K × K loss matrix L, with $L_{kk′}$ being the loss incurred for classifying a class k observation as class k′. Typically no loss is incurred for correct classifications,that is, $L_{kk}$ = 0 ∀k. To incorporate the losses into the modeling process, we could modify the Gini index to $\Sigma_{k\neq k'} L_{kk'}\hat{p}_{mk}\hat{p}_{mk'}$ this would be the expected loss incurred by the randomized rule. This works for the multiclass case, but in the two-class case has no effect, since the coefficient of $\hat{p}_{mk}\hat{p}_{mk'}$ is $L_{kk′} + L_{k'k}$ For two classes a better approach is to weight the observations in class k by $L_{kk′}$. This can be used in the multiclass case only if, as a function of k, $L_{kk′}$ doesn’t depend on k′. Observation weighting can be used with the deviance as well. The effect of observation weighting is to alter the prior probability on the classes. In a terminal node, the empirical Bayes rule implies that we classify to class k(m) = $argmin_k \Sigma_l L_{lk}\hat{p}_{ml}$
Instability of Trees One major problem with trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. The major reason for this instability is the hierarchical nature of the process: the effect of an error in the top split is propagated down to all of the splits below it. One can alleviate this to some degree by trying to use a more stable split criterion, but the inherent instability is not removed. It is the price to be paid for estimating a simple, tree-based structure from the data. Bagging (Section 8.7) averages many trees to reduce this variance.
Lack of Smoothness Another limitation of trees is the lack of smoothness of the prediction surface. In classification with 0/1 loss, this doesn’t hurt much, since bias in estimation of the class probabilities has a limited effect. However, this can degrade performance in the regression setting, where we would normally expect the underlying function to be smooth.
feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons:
Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). Archetypal cases for the application of feature selection include the analysis of written texts and DNA microarray data, where there are many thousands of features, and a few tens to hundreds of samples.
With subset selection we retain only a subset of the variables, and eliminate the rest from the model.
Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model.
There are a number of different strategies for choosing the subset:
- Best subset regression finds for each k ∈ {0, 1, 2, . . . , p} the subset of size k
that gives smallest residual sum of squares
Rather than search through all possible subsets (which becomes infeasible for p much larger than 40), we can seek a good path through them. Forwardstepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. With many candidate predictors, this might seem like a lot of computation; however, clever updating algorithms can exploit the QR decomposition for the current fit to rapidly establish the next candidate.
Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the variable with the smallest Z-score . Backward selection can only be used when N > p, while forward stepwise can always be used.
</ul>
As opposed to cross validation which aims at calculating the error based on choosing different collections of records and tests the performance of the model using the different parts of the dataset for training and testing.
Example
working on the music dataset, subset selection would be used to decide which features to include in the analysis and cross validation would be used to apply a model consisting of those features with varying input training and testing data to calculate what portion of the dataset is best for the training and testing activities.
Working on an example of prostate cancer data. cross-validation works by dividing the training data randomly into ten equal parts. The learning method is fit—for a range of values of the complexity parameter—to nine-tenths of the data, and the prediction error is computed on the remaining one-tenth. This is done in turn for each one-tenth of the data, and the ten prediction error estimates are averaged. From this we obtain an estimated prediction error curve as a function of the complexity parameter. Note that we have already divided these data into a training set of size 67 and a test set of size 30. Cross-validation is applied to the training set, since selecting the shrinkage parameter is part of the training process. The test set is there to judge the performance of the selected model. The estimated prediction error curves are shown in Figure below. Many of the curves are very flat over large ranges near their minimum. Included are estimated standard error bands for each estimated error rate, based on the ten error estimates computed by cross-validation.
We have used the “one-standard-error” rule—we pick the most parsimonious model within one standard error of the minimum (Section 7.10, page 244). Such a rule acknowledges the fact that the tradeoff curve is estimated with error, and hence takes a conservative approach. Best-subset selection chose to use the two predictors lcvol and lweight. The last two lines of the table give the average prediction error (and its estimated standard error) over the test set.
%%html
<img src="subset-cross.png" >
Estimated prediction error curves and their standard errors for the various selection and shrinkage methods. Each curve is plotted as a function of the corresponding complexity parameter for that method. The horizontal axis has been chosen so that the model complexity increases as we move from left to right. The estimates of prediction error and their standard errors were obtained by tenfold cross-validation; The least complex model within one standard error of the best is chosen, indicated by the purple vertical broken lines.
10 Points for Logistic Regression accuracy and output plots.
import pandas as pd
import numpy as np
import pandas_profiling
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
df = pd.read_csv('../Assignment 4/insurance_data.csv')
df.describe()
# pandas_profiling.ProfileReport(df)
df.head()
df.info()
# Data Wrangling
df.isnull().sum()
h = df['age'].plot.hist()
h.size = (20 ,20)
x = list(df['age'])
y = list(df['bought_insurance'])
X = np.reshape(x, (-1, 1))
Y = np.reshape(y, (-1, 1))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state= 1)
X_train.shape
lr = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
model = lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
print ("Score: ", model.score(X_test, y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))
# import numpy as np
plt.scatter( X_train, y_train, label='Training')
plt.scatter( X_test, y_test, color = 'red', label='Test - Actuals')
plt.scatter( X_test, predictions, color = 'orange', label='Test - Predicted')
plt.axhline(y=0, color='k', linestyle='-')
plt.axhline(y=1, color='k', linestyle='-')
plt.axhline(y=0.5, color='b', linestyle='--')
z = np.arange(df['age'].min(), df['age'].max(),0.5)
Z = np.reshape(z, (-1, 1))
predictions_2 = model.predict_proba(Z)[:,1]
plt.plot(z, predictions_2, color='red', linewidth=3)
plt.xlabel("Age")
plt.ylabel("Probability")
plt.legend( bbox_to_anchor=(1,1))
plt.show()
Discussion
data = pd.read_csv('../Assignment 4/data.csv')
data.head()
data.describe()
duration_ms: The duration of the track in milliseconds.
key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
time_signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. The distribution of values for this feature looks like this: Acousticness distribution
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. The distribution of values for this feature looks like this: Danceability distribution
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. The distribution of values for this feature looks like this: Energy distribution
instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater the likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. The distribution of values for this feature looks like this: Instrumentalness distribution
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides a strong likelihood that the track is live. The distribution of values for this feature looks like this: Liveness distribution
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 dB.
speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audiobook, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). The distribution of values for this feature looks like this: Valence distribution tempo float The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, the tempo is the speed or pace of a given piece and derives directly from the average beat duration.
tempo: The overall estimated tempo of the section in beats per minute (BPM). In musical terminology, the tempo is the speed or pace of a given piece and derives directly from the average beat duration.
key : The estimated overall key of the section. The values in this field ranging from 0 to 11 mapping to pitches using standard Pitch Class notation (E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on). If no key was detected, the value is -1.
mode: integer Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. This field will contain a 0 for “minor”, a 1 for “major”, or a -1 for no result. Note that the major key (e.g. C major) could more likely be confused with the minor key at 3 semitones lower (e.g. A minor) as both keys carry the same pitches.
mode_confidence: The confidence, from 0.0 to 1.0, of the reliability of the mode.
time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”.
data.isnull().sum()
data.info()
data.nunique()
# Label Encoder: Change Text Variables to Numerical
LE = LabelEncoder()
data['artist'] = LE.fit_transform(data['artist'])
data.head()
data.drop('song_title', axis=1, inplace=True)
data.drop('time_signature', axis=1, inplace=True) #Almost all values are same
data.head()
# Inital Scaling for data evaluation
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data_2 = data.copy()
x = list(data['duration_ms'])
x = np.reshape(x, (-1, 1))
x = sc.fit_transform(x)
data_2['duration_ms'] = x
x = list(data['tempo'])
x = np.reshape(x, (-1, 1))
x = sc.fit_transform(x)
data_2['tempo'] = x
x = list(data['loudness'])
x = np.reshape(x, (-1, 1))
x = sc.fit_transform(x)
data_2['loudness'] = x
x = list(data['artist'])
x = np.reshape(x, (-1, 1))
x = sc.fit_transform(x)
data_2['artist'] = x
# data_2.drop('target', axis=1, inplace=True)
data_2.head()
data.plot.box(figsize=(12,8))
plt.xticks(
list(range(1, len(data.columns)+1)),
data.columns,
rotation='vertical')
data_2.plot.box(figsize=(12,8))
plt.xticks(
list(range(1, len(data_2.columns)+1)),
data_2.columns,
rotation='vertical')
data_2.columns
# Pick out features which have high
data_2.corr()
Features Dropped:
Features chosen: features = ["danceability", "loudness", "valence", "energy", "instrumentalness", "acousticness", "key", "speechiness", "duration_ms", "liveness"]
X = data.values[:, 1:-2]
Y = data.values[:,-1]
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 0)
# Standardization
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
y_train = np.reshape(y_train, (-1, 1))
y_train = sc.fit_transform(y_train)
y_train = np.reshape(y_train, (-1, 1))
# y_test = sc.fit_transform(y_test)
from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier
import graphviz
from scipy import misc
import io
import pydotplus
import matplotlib.image as mpimg
c = DecisionTreeClassifier(min_samples_split = 100)
# The minimum value to split value to split the samples is used as 100, based on the data size.
# If the minimum split value is a less value the dense decision tree will be generated.
features = ["danceability", "loudness", "valence", "energy", "instrumentalness", "acousticness", "key", "speechiness",
"duration_ms", "liveness"]
train, test = train_test_split(data, test_size = 0.30)
Xtrain = train[features]
Ytrain = train["target"]
xtest = test[features]
ytest = test["target"]
dt = c.fit(Xtrain, Ytrain)
import os
os.environ["PATH"] += os.pathsep + r'C:\Program Files (x86)\Graphviz2.38\bin'
import matplotlib as mpl
def show_tree(tree, features, path):
f = io.StringIO()
export_graphviz(tree, out_file=f, feature_names = features)
pydotplus.graph_from_dot_data(f.getvalue()).write_png(path)
img = mpimg.imread(path)
imgplot = plt.imshow(img)
# plt.show()
# plt.rcParams['figure.dpi']= 600
# plt.rcParams['figure.figsize']= (40, 40)
show_tree(dt, features, 'dec_tree_01.png')
y_pred = c.predict(xtest)
from sklearn import metrics
print("Accuracy: ",metrics.accuracy_score(ytest, y_pred))
print(classification_report(ytest, y_pred))
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from pprint import pprint
scalar = StandardScaler()
c = DecisionTreeClassifier(min_samples_split = 100)
# The minimum value to split value to split the samples is used as 100, based on the data size.
# If the minimum split value is a less value the dense decision tree will be generated.
features = ["danceability", "loudness", "valence", "energy", "instrumentalness", "acousticness", "key", "speechiness",
"duration_ms", "liveness"]
x = data[features]
y = data["target"]
depth = []
d_scores = []
for i in range(3,30):
c = DecisionTreeClassifier(max_depth=i)
# Perform 7-fold cross validation
pipeline = Pipeline([('transformer', scalar), ('estimator', c)])
scores = cross_val_score(pipeline, X=x, y=y, cv=7, n_jobs=4)
depth.append((i,scores.mean()))
d_scores.append((i,scores))
pprint(depth)
for ea in d_scores:
print("height = ", ea[0]," -> \n", ea[1], '\n')
k_fold_summary = []
for i in range(3,15):
c = DecisionTreeClassifier(max_depth=5)
# Perform 7-fold cross validation
pipeline = Pipeline([('transformer', scalar), ('estimator', c)])
scores = cross_val_score(pipeline, X=x, y=y, cv=i, n_jobs=4)
k_fold_summary.append((i,scores.mean()))
# d_scores.append((i,scores))
pprint(k_fold_summary)